Skip to content

Claude/web crawler python3 modernize n21pe#21

Merged
jmg merged 4 commits into
masterfrom
claude/web-crawler-python3-modernize-N21pe
Jun 5, 2026
Merged

Claude/web crawler python3 modernize n21pe#21
jmg merged 4 commits into
masterfrom
claude/web-crawler-python3-modernize-N21pe

Conversation

@jmg

@jmg jmg commented Jun 5, 2026

Copy link
Copy Markdown
Owner

No description provided.

claude added 4 commits June 5, 2026 18:23
…itemaps

Closes the main "framework" gaps vs Scrapy, built on the existing async engine
(httpx, retries, rate limiting, robots.txt, de-duplication).

- crawley.spider: `Request` (callback, meta, cb_kwargs, headers, priority,
  dont_filter, errback, fingerprint/replace), `Item`, and a callback-driven
  `Spider` (parse/start_requests/on_item, depth tracking, fingerprint de-dup).
- response.follow()/response.meta and response.request for list->detail crawls.
- crawley.pipelines: `ItemPipeline` + `DropItem`; spiders run items through the
  pipeline chain (open_spider/close_spider/process_item, sync or async).
- crawley.spiders: `LinkExtractor` (allow/deny/restrict_xpaths/restrict_css),
  `Rule`, `CrawlSpider` (rule-based following) and `SitemapSpider`
  (sitemap.xml + sitemap index).
- RequestManager.make_request accepts per-request headers.
- Extractors parse from bytes so XML-with-declaration (sitemaps) is handled.

Tests (169 -> 180): test_spider, test_spiders; conftest serves /sitemap.xml.
…ders

- crawley.http.playwright.PlaywrightRequestManager: render pages with a
  headless browser (lazy import), with per-host throttling and retries; wired
  into the engine via `render_js = True` and `playwright_options`. Optional
  extra `crawley[js]`.
- Docs: new "Spiders" page (Request/callbacks/follow, item pipelines,
  CrawlSpider/LinkExtractor, SitemapSpider, JS rendering); API reference and
  nav updated.
- examples/06_spider.py (callback spider + pipeline), indexed and test-covered.
- README "Spiders" section; CHANGELOG updated.

Tests (180 -> 187): test_playwright (render path mocked, no browser needed)
plus the spider example.
- crawley.stats.StatsCollector: per-crawl counters (requests, responses,
  status/<code>, request_errors, robots_blocked, items/items_dropped, elapsed),
  exposed as crawler/spider `stats` and logged on finish.
- crawley.http.cache.HttpCache: on-disk response cache keyed by
  method+url+body. Enable with `http_cache = True` / `http_cache_dir`; wired
  into RequestManager, FastRequestManager and the Playwright manager.
- crawley.spider.FormRequest + FormRequest.from_response(): read a <form>,
  pre-fill inputs/selects/textareas, honour its method (GET -> query string),
  override fields via formdata.

Docs (crawler stats/cache, spiders forms, API reference) and CHANGELOG updated.
Tests (187 -> 201): test_stats, test_cache, test_forms; conftest serves
/login-form.
- crawley.middlewares.DownloaderMiddleware: process_request / process_response
  / process_exception chains (sync or async) wrapping every Spider download.
  process_request may short-circuit with a Response or reschedule a Request;
  process_exception can recover from errors.
- crawley.http.autothrottle.AutoThrottle: adapt the per-host delay to the
  observed response latency (target_concurrency, start/max delay). Enable with
  `autothrottle = True`; Response now carries `.latency` (httpx elapsed /
  measured render time), fed to the per-host rate limiter.

Docs (spiders middlewares, politeness AutoThrottle, API reference) and
CHANGELOG updated. Tests (201 -> 213): test_middlewares, test_autothrottle.
@jmg jmg merged commit fa0bf52 into master Jun 5, 2026
7 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants